Setup Your Spark Project
Required Spark Version
LakeSoul is currently available with Scala version 2.12 and Spark version 3.3.
Setup (Py)Spark Shell or Spark SQL Shell
To use spark-shell
, pyspark
or spark-sql
shells, you should include LakeSoul's dependencies. There are two approaches to achieve this.
Use Maven Coordinates via --packages
spark-shell --packages com.dmetasoul:lakesoul-spark:2.2.0-spark-3.3
Use Local Packages
You can find the LakeSoul packages from our release page: Releases.
Download the jar file and pass it to spark-submit
.
spark-submit --jars "lakesoul-spark-2.2.0-spark-3.3.jar"
Or you could directly put the jar into $SPARK_HOME/jars
Setup Java/Scala Project
Include maven dependencies in your project:
<dependency>
<groupId>com.dmetasoul</groupId>
<artifactId>lakesoul</artifactId>
<version>2.2.0-spark-3.3</version>
</dependency>
Pass lakesoul_home
Environment Variable to Your Job
If you are using Spark's local or client mode, you could just export the env var in your shell:
export lakesoul_home=/path/to/lakesoul.properties
If you are using Spark's cluster mode, in which the driver would also be scheduled into Yarn or K8s cluster, you can setup the driver's env:
- For Hadoop Yarn, pass
--conf spark.yarn.appMasterEnv.lakesoul_home=lakesoul.properties --files /path/to/lakesoul.properties
tospark-submit
command; - For K8s, pass
--conf spark.kubernetes.driverEnv.lakesoul_home=lakesoul.properties --files /path/to/lakesoul.properties
tospark-submit
command.